{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "scrolled": false
   },
   "source": [
    "# Implementing PPO-clipped method\n",
    "\n",
    "Let's implement the PPO-clipped method for swinging up the pendulum task. The code\n",
    "used in this section is adapted from one of the very good PPO implementations (https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/12_Proximal_Policy_Optimization) by Morvan. \n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the gym environment\n",
    "\n",
    "Let's create a pendulum environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v0').unwrapped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the state shape of the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_shape = env.observation_space.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the action shape of the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_shape = env.action_space.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the pendulum is a continuous environment and thus our action space consists of\n",
    "continuous values. So, we get the bound of our action space:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_bound = [env.action_space.low, env.action_space.high]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the epsilon value which is used in the clipped objective:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "epsilon = 0.2 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the PPO class\n",
    "\n",
    "Let's define the class called PPO where we will implement the PPO algorithm.  For a clear understanding, you can also check the detailed explanation of code on the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PPO(object):\n",
    "    #first, let's define the init method\n",
    "    def __init__(self):\n",
    "        \n",
    "        #start the TensorFlow session\n",
    "        self.sess = tf.Session()\n",
    "        \n",
    "        #define the placeholder for the state\n",
    "        self.state_ph = tf.placeholder(tf.float32, [None, state_shape], 'state')\n",
    "\n",
    "        #now, let's build the value network which returns the value of a state\n",
    "        with tf.variable_scope('value'):\n",
    "            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu)\n",
    "            self.v = tf.layers.dense(layer1, 1)\n",
    "            \n",
    "            #define the placeholder for the Q value\n",
    "            self.Q = tf.placeholder(tf.float32, [None, 1], 'discounted_r')\n",
    "            \n",
    "            #define the advantage value as the difference between the Q value and state value\n",
    "            self.advantage = self.Q - self.v\n",
    "\n",
    "            #compute the loss of the value network\n",
    "            self.value_loss = tf.reduce_mean(tf.square(self.advantage))\n",
    "            \n",
    "            #train the value network by minimizing the loss using Adam optimizer\n",
    "            self.train_value_nw = tf.train.AdamOptimizer(0.002).minimize(self.value_loss)\n",
    "\n",
    "        #now, we obtain the policy and its parameter from the policy network\n",
    "        pi, pi_params = self.build_policy_network('pi', trainable=True)\n",
    "\n",
    "        #obtain the old policy and its parameter from the policy network\n",
    "        oldpi, oldpi_params = self.build_policy_network('oldpi', trainable=False)\n",
    "        \n",
    "        #sample an action from the new policy\n",
    "        with tf.variable_scope('sample_action'):\n",
    "            self.sample_op = tf.squeeze(pi.sample(1), axis=0)       \n",
    "\n",
    "        #update the parameters of the old policy\n",
    "        with tf.variable_scope('update_oldpi'):\n",
    "            self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]\n",
    "\n",
    "        #define the placeholder for the action\n",
    "        self.action_ph = tf.placeholder(tf.float32, [None, action_shape], 'action')\n",
    "        \n",
    "        #define the placeholder for the advantage\n",
    "        self.advantage_ph = tf.placeholder(tf.float32, [None, 1], 'advantage')\n",
    "\n",
    "        #now, let's define our surrogate objective function of the policy network\n",
    "        with tf.variable_scope('loss'):\n",
    "            with tf.variable_scope('surrogate'):\n",
    "                \n",
    "                #first, let's define the ratio \n",
    "                ratio = pi.prob(self.action_ph) / oldpi.prob(self.action_ph)\n",
    "    \n",
    "                #define the objective by multiplying ratio and the advantage value\n",
    "                objective = ratio * self.advantage_ph\n",
    "                \n",
    "                #define the objective function with the clipped and unclipped objective:\n",
    "                L = tf.reduce_mean(tf.minimum(objective, \n",
    "                                   tf.clip_by_value(ratio, 1.-epsilon, 1.+ epsilon)*self.advantage_ph))\n",
    "                \n",
    "            \n",
    "            #now, we can compute the gradient and maximize the objective function using gradient\n",
    "            #ascent. However, instead of doing that, we can convert the above maximization objective\n",
    "            #into the minimization objective by just adding a negative sign. So, we can denote the loss of\n",
    "            #the policy network as:\n",
    "            \n",
    "            self.policy_loss = -L\n",
    "    \n",
    "        #train the policy network by minimizing the loss using Adam optimizer:\n",
    "        with tf.variable_scope('train_policy'):\n",
    "            self.train_policy_nw = tf.train.AdamOptimizer(0.001).minimize(self.policy_loss)\n",
    "        \n",
    "        #initialize all the TensorFlow variables\n",
    "        self.sess.run(tf.global_variables_initializer())\n",
    "\n",
    "    #now, let's define the train function\n",
    "    def train(self, state, action, reward):\n",
    "        \n",
    "        #update the old policy\n",
    "        self.sess.run(self.update_oldpi_op)\n",
    "        \n",
    "        #compute the advantage value\n",
    "        adv = self.sess.run(self.advantage, {self.state_ph: state, self.Q: reward})\n",
    "            \n",
    "        #train the policy network\n",
    "        [self.sess.run(self.train_policy_nw, {self.state_ph: state, self.action_ph: action, self.advantage_ph: adv}) for _ in range(10)]\n",
    "        \n",
    "        #train the value network\n",
    "        [self.sess.run(self.train_value_nw, {self.state_ph: state, self.Q: reward}) for _ in range(10)]\n",
    "\n",
    "    \n",
    "    #we define a function called build_policy_network for building the policy network. Note\n",
    "    #that our action space is continuous here, so our policy network returns the mean and\n",
    "    #variance of the action as an output and then we generate a normal distribution using this\n",
    "    #mean and variance and we select an action by sampling from this normal distribution\n",
    "\n",
    "    def build_policy_network(self, name, trainable):\n",
    "        with tf.variable_scope(name):\n",
    "            \n",
    "            #define the layer of the network\n",
    "            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu, trainable=trainable)\n",
    "            \n",
    "            #compute mean\n",
    "            mu = 2 * tf.layers.dense(layer1, action_shape, tf.nn.tanh, trainable=trainable)\n",
    "            \n",
    "            #compute standard deviation\n",
    "            sigma = tf.layers.dense(layer1, action_shape, tf.nn.softplus, trainable=trainable)\n",
    "            \n",
    "            #compute the normal distribution\n",
    "            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)\n",
    "            \n",
    "        #get the parameters of the policy network\n",
    "        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)\n",
    "        return norm_dist, params\n",
    "\n",
    "    #let's define a function called select_action for selecting the action\n",
    "    def select_action(self, state):\n",
    "        state = state[np.newaxis, :]\n",
    "        \n",
    "        #sample an action from the normal distribution generated by the policy network\n",
    "        action = self.sess.run(self.sample_op, {self.state_ph: state})[0]\n",
    "        \n",
    "        #we clip the action so that they lie within the action bound and then we return the action\n",
    "        action =  np.clip(action, action_bound[0], action_bound[1])\n",
    "\n",
    "        return action\n",
    "\n",
    "    #we define a function called get_state_value to obtain the value of the state computed by the value network\n",
    "    def get_state_value(self, state):\n",
    "        if state.ndim < 2: state = state[np.newaxis, :]\n",
    "        return self.sess.run(self.v, {self.state_ph: state})[0, 0]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the network\n",
    "\n",
    "Now, let's start training the network. First, let's create an object to our PPO class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppo = PPO()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of episodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of time steps in each episode:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_timesteps = 200"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the discount factor, $\\gamma$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "gamma = 0.9"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the batch size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Episode:0, Return: -1597.0752474266517\n"
     ]
    }
   ],
   "source": [
    "#for each episode\n",
    "for i in range(num_episodes):\n",
    "    \n",
    "    #initialize the state by resetting the environment\n",
    "    state = env.reset()\n",
    "    \n",
    "    #initialize the lists for holding the states, actions, and rewards obtained in the episode\n",
    "    episode_states, episode_actions, episode_rewards = [], [], []\n",
    "    \n",
    "    #initialize the return\n",
    "    Return = 0\n",
    "    \n",
    "    #for every step\n",
    "    for t in range(num_timesteps):   \n",
    "        \n",
    "        #render the environment\n",
    "        env.render()\n",
    "        \n",
    "        #select the action\n",
    "        action = ppo.select_action(state)\n",
    "        \n",
    "        #perform the selected action\n",
    "        next_state, reward, done, _ = env.step(action)\n",
    "        \n",
    "        #store the state, action, and reward in the list\n",
    "        episode_states.append(state)\n",
    "        episode_actions.append(action)\n",
    "        episode_rewards.append((reward+8)/8)    \n",
    "        \n",
    "        #update the state to the next state\n",
    "        state = next_state\n",
    "        \n",
    "        #update the return\n",
    "        Return += reward\n",
    "        \n",
    "        #if we reached the batch size or if we reached the final step of the episode\n",
    "        if (t+1) % batch_size == 0 or t == num_timesteps-1:\n",
    "            \n",
    "            #compute the value of the next state\n",
    "            v_s_ = ppo.get_state_value(next_state)\n",
    "            \n",
    "            #compute Q value as sum of reward and discounted value of next state\n",
    "            discounted_r = []\n",
    "            for reward in episode_rewards[::-1]:\n",
    "                v_s_ = reward + gamma * v_s_\n",
    "                discounted_r.append(v_s_)\n",
    "            discounted_r.reverse()\n",
    "    \n",
    "            #stack the episode states, actions, and rewards:\n",
    "            es, ea, er = np.vstack(episode_states), np.vstack(episode_actions), np.array(discounted_r)[:, np.newaxis]\n",
    "            \n",
    "            #empty the lists\n",
    "            episode_states, episode_actions, episode_rewards = [], [], []\n",
    "            \n",
    "            #train the network\n",
    "            ppo.train(es, ea, er)\n",
    "        \n",
    "    #print the return for every 10 episodes\n",
    "    if i %10 ==0:\n",
    "         print(\"Episode:{}, Return: {}\".format(i,Return))  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we learned how PPO with clipped objective works and how to implement them, in the next section we will learn another interesting type of PPO algorithm called PPO with\n",
    "the penalized objective."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
